Our primary goals in our research and analysis are to boost Regork's sales and increase operational effectiveness. We have selected three problem statements- relating to the optimal package sizes to position in our business, to the best way to issue coupons to consumers in order to maximize sales, and to atttain maximum coupon redemption rate for more profitable marketing.
We worked on the problem statements and analysed the existing ‘CompleteJourney’ datasets relating to transactions, products, and demographics. We conducted the explorartory data analysis on the datasets to address the problem statements and derive meaningful insights from the data sets like Transactions, Demographics and Product.
Is the retail discount distribution in relation to the age groups ideal?Are discounts yielding higher sales?
For this after cleaning the data we have grouped the ages into 4 groups 19-34, 35-44, 45-54 and 55+ and formed a new data frame called "demo_tran". Then filtering the sales data based up on the retail discount and plotting the required graphs for generating insights addressing the business problem.
We found the below insights after peforming the analysis from the graphs:
from completejourney_py import get_data
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import calendar
df = px.data.tips()
import warnings
warnings.filterwarnings('ignore')
transactions = get_data()['transactions']
products = get_data()['products']
promotions = get_data()['promotions']
demographics = get_data()['demographics']
transactions = get_data()['transactions']
campaigns = get_data()['campaigns']
coupons = get_data()['coupons']
campaign_descriptions = get_data()['campaign_descriptions']
coupon_redemptions = get_data()['coupon_redemptions']
demographics.head()
| household_id | age | income | home_ownership | marital_status | household_size | household_comp | kids_count | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 65+ | 35-49K | Homeowner | Married | 2 | 2 Adults No Kids | 0 |
| 1 | 1001 | 45-54 | 50-74K | Homeowner | Unmarried | 1 | 1 Adult No Kids | 0 |
| 2 | 1003 | 35-44 | 25-34K | None | Unmarried | 1 | 1 Adult No Kids | 0 |
| 3 | 1004 | 25-34 | 15-24K | None | Unmarried | 1 | 1 Adult No Kids | 0 |
| 4 | 101 | 45-54 | Under 15K | Homeowner | Married | 4 | 2 Adults Kids | 2 |
demo_tran = demographics.merge(transactions, on = ['household_id'], how = 'inner')
demo_tran = demo_tran[~demo_tran.isnull().any(axis=1)]
demo_tran.head()
| household_id | age | income | home_ownership | marital_status | household_size | household_comp | kids_count | store_id | basket_id | product_id | quantity | sales_value | retail_disc | coupon_disc | coupon_match_disc | week | transaction_timestamp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 65+ | 35-49K | Homeowner | Married | 2 | 2 Adults No Kids | 0 | 436 | 31317046240 | 823721 | 1 | 2.99 | 0.0 | 0.0 | 0.0 | 2 | 2017-01-07 18:55:24 |
| 1 | 1 | 65+ | 35-49K | Homeowner | Married | 2 | 2 Adults No Kids | 0 | 436 | 31317046240 | 832990 | 2 | 5.98 | 0.0 | 0.0 | 0.0 | 2 | 2017-01-07 18:55:24 |
| 2 | 1 | 65+ | 35-49K | Homeowner | Married | 2 | 2 Adults No Kids | 0 | 436 | 31317046240 | 854920 | 1 | 1.49 | 0.0 | 0.0 | 0.0 | 2 | 2017-01-07 18:55:24 |
| 3 | 1 | 65+ | 35-49K | Homeowner | Married | 2 | 2 Adults No Kids | 0 | 436 | 31317046240 | 856942 | 1 | 2.99 | 0.0 | 0.0 | 0.0 | 2 | 2017-01-07 18:55:24 |
| 4 | 1 | 65+ | 35-49K | Homeowner | Married | 2 | 2 Adults No Kids | 0 | 436 | 31317046240 | 868401 | 1 | 0.59 | 0.0 | 0.0 | 0.0 | 2 | 2017-01-07 18:55:24 |
demo_tran['age'].value_counts()
45-54 189439 35-44 150652 25-34 77813 65+ 46000 55-64 40091 19-24 14978 Name: age, dtype: int64
conditions = [
((demo_tran['age'] == "19-24") | (demo_tran['age'] == "25-34")),
((demo_tran['age'] == "35-44")),
((demo_tran['age'] == "45-54")),
((demo_tran['age'] == "55-64") | (demo_tran['age'] == "65+"))
]
# create a list of the values we want to assign for each condition
values = ['19-34', '35-44', '45-54', '55+']
demo_tran['age_group'] = np.select(conditions, values)
demo_tran['age_group'].value_counts()
45-54 189439 35-44 150652 19-34 92791 55+ 86091 Name: age_group, dtype: int64
demo_tran1 = demo_tran.copy()
demo_tran1 = demo_tran1[(demo_tran1['retail_disc'] > 0) & (demo_tran1['retail_disc'] < 1) ].groupby('age_group')\
.aggregate({'sales_value':'sum', 'retail_disc':'sum'})
demo_tran1['percentage'] = demo_tran1['retail_disc']*100/demo_tran1['sales_value']
demo_tran1 = demo_tran1.reset_index()
demo_tran1.head()
| age_group | sales_value | retail_disc | percentage | |
|---|---|---|---|---|
| 0 | 19-34 | 71387.40 | 14601.23 | 20.453511 |
| 1 | 35-44 | 121789.53 | 24400.26 | 20.034776 |
| 2 | 45-54 | 139258.63 | 29145.40 | 20.928972 |
| 3 | 55+ | 60278.84 | 12775.25 | 21.193590 |
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(
go.Scatter(x=demo_tran1.age_group, y=demo_tran1.percentage, name="Percent discount used"),
secondary_y=True,
)
fig.add_trace(
go.Bar(x=demo_tran1['age_group'], y=demo_tran1['sales_value'], name = 'Total Sales'),
secondary_y=False,
)
fig.add_trace(
go.Bar(x=demo_tran1['age_group'], y=demo_tran1['retail_disc'], name = 'Total Discount'),
secondary_y=False,
)
# Add figure title
fig.update_layout(
title_text="Retail Discount (for less than $1) v/s Total Sales Revenue (by Age Group)"
)
fig.update_yaxes(range=[0,40], secondary_y=True)
# Set x-axis title
fig.update_xaxes(title_text="Age Group")
# Set y-axes titles
fig.update_yaxes(title_text="Total Sales Value", secondary_y=False)
fig.update_yaxes(title_text="Total percentage of discount used", secondary_y=True)
fig.show()
Which package sizes generate the highest sales for the top 5 grocery products. To understand which package size to focus while procuring from manufacturer.
In order to generate insights, we first cleaned the data and identified various product categories in Grocery. To determine which package size is generating maximum sales, we then analysed those products by sales along with their package sizes to determine which package size generates the most revenue.
Based on the graph analysis, we discovered the following insights:
i. Drinking water and soft drinks packaged in 12 ounce portions generate the best sales.
ii.Soft drinks is the top product based on sales value.
Below is the code and plots for the same:
products_filter = products[products['department']=="GROCERY"]
product_transactions = products_filter.merge(transactions, on = ['product_id'], how = 'inner')
products_transactions_oz = product_transactions[product_transactions['package_size'].str.contains("OZ", na = False)]
products_transactions_lb = product_transactions[product_transactions['package_size'].str.contains("LB", na = False)]
products_transactions_oz_groupby = products_transactions_oz.groupby(['product_id','product_category','package_size'])\
.aggregate({'sales_value':'sum'}).reset_index()\
.sort_values('sales_value', ascending=False).head(10)
unidff = products_transactions_oz_groupby.groupby('product_category').first().reset_index().sort_values('sales_value', ascending=False)
unidff['package_size'] = unidff.package_size.str.extract('(\d+)')
unidff["package_size"] = pd.to_numeric(unidff["package_size"])
unidff = unidff.rename(columns = {'package_size':'package_size_in_ounces'})
# -----------------------------
products_transactions_lb_groupby = products_transactions_lb.groupby(['product_id','product_category','package_size'])\
.aggregate({'sales_value':'sum'}).reset_index()\
.sort_values('sales_value', ascending=False).head(7)
unidff_lb = products_transactions_lb_groupby.groupby('product_category').first().reset_index().sort_values('sales_value', ascending=False)
unidff_lb['package_size'] = unidff_lb.package_size.str.extract('(\d+)')
unidff_lb["package_size"] = pd.to_numeric(unidff_lb["package_size"])
unidff_lb = unidff_lb.rename(columns = {'package_size':'package_size_in_pounds'})
labels = unidff.product_category
max_sales = unidff.sales_value
package_size_in_ounces = unidff.package_size_in_ounces
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
ax2 = ax.twinx()
rects1 = ax.bar(x - width/2, max_sales, width, label='max_sales', color = 'purple')
rects2 = ax2.bar(x + width/2, package_size_in_ounces, width, label='package_size_in_ounces', color = 'yellow')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Max sales')
ax2.set_ylabel('Package Size in ounces')
ax.set_title('Max sales and package size in ounces')
ax.set_xticks(x, labels)
ax.legend()
ax2.legend()
fig.autofmt_xdate(rotation=45)
plt.show()
prod_chart = unidff.sort_values(by = ['sales_value'],
ascending = False).iloc[0:15].reset_index()
fig1 = go.Figure()
# Draw points
fig1.add_trace(go.Scatter(x = prod_chart["sales_value"],
y = prod_chart["product_category"],
mode = 'markers',
marker_color ='darkblue',
marker_size = 10))
# Draw lines
for i in range(0, len(prod_chart)):
fig1.add_shape(type='line',
x0 = 0, y0 = i,
x1 = prod_chart["sales_value"][i],
y1 = i,
line=dict(color='crimson', width = 3))
fig1.layout.hovermode = False
fig1.update_layout(
title_text="Top 5 product categories of Groceries having maximum sales<br><sup>This data has been collected over the year 2017</sup>"
)
fig1.show()
To analyse campaign performance based on coupons redemption rates, total sales generated by the products (which were part of the campaign), households targeted, ages and income ranges of the customers.
We utilized the completejourney dataset to identify trends and patterns in the data and generated insights to help the leadership make effective data-driven decisions.
Throughout the analysis, we identify the best-case practices and patterns in certain campaigns and suggest areas of improvement to simulate the poorer performing campaigns.
We analyzed the campaign performance based on the redemption rates, total sales generated by the products (which were part of the campaign), households targeted, ages and income ranges of the customers.
Insights from the Graphs
campaign_descriptions.head()
| campaign_id | campaign_type | start_date | end_date | |
|---|---|---|---|---|
| 0 | 1 | Type B | 2017-03-03 | 2017-04-09 |
| 1 | 2 | Type B | 2017-03-08 | 2017-04-09 |
| 2 | 3 | Type C | 2017-03-13 | 2017-05-08 |
| 3 | 4 | Type B | 2017-03-29 | 2017-04-30 |
| 4 | 5 | Type B | 2017-04-03 | 2017-05-07 |
coupon_redemptions.head()
| household_id | coupon_upc | campaign_id | redemption_date | |
|---|---|---|---|---|
| 0 | 1029 | 51380041013 | 26 | 2017-01-01 |
| 1 | 1029 | 51380041313 | 26 | 2017-01-01 |
| 2 | 165 | 53377610033 | 26 | 2017-01-03 |
| 3 | 712 | 51380041013 | 26 | 2017-01-07 |
| 4 | 712 | 54300016033 | 26 | 2017-01-07 |
campaigns.head()
| campaign_id | household_id | |
|---|---|---|
| 0 | 1 | 105 |
| 1 | 1 | 1238 |
| 2 | 1 | 1258 |
| 3 | 1 | 1483 |
| 4 | 1 | 2200 |
coupons.head()
| coupon_upc | product_id | campaign_id | |
|---|---|---|---|
| 0 | 10000085207 | 9676830 | 26 |
| 1 | 10000085207 | 9676943 | 26 |
| 2 | 10000085207 | 9676944 | 26 |
| 3 | 10000085207 | 9676947 | 26 |
| 4 | 10000085207 | 9677008 | 26 |
# Compute campaign products total sales value for different campaign types
camp_prod_tsv = coupons[['campaign_id', 'product_id']] \
.merge(campaign_descriptions[['campaign_id', 'campaign_type']], on='campaign_id') \
.merge(transactions[['product_id', 'transaction_timestamp', 'sales_value']], on='product_id') \
.groupby('campaign_type') \
.agg(total_sales=('sales_value', 'sum')) \
.reset_index() \
.sort_values('campaign_type')
camp_prod_tsv
| campaign_type | total_sales | |
|---|---|---|
| 0 | Type A | 6817835.35 |
| 1 | Type B | 623471.96 |
| 2 | Type C | 242966.70 |
# Households Targeted by campaigns
camp_hh_targeted = campaigns.groupby('campaign_id', as_index=False) \
.agg(hh_targeted=('household_id', 'nunique')) \
.sort_values('campaign_id') \
.merge(campaign_descriptions, on='campaign_id') \
.loc[:, ['campaign_type', 'campaign_id', 'hh_targeted']] \
camp_hh_targeted_type = camp_hh_targeted.groupby('campaign_type', as_index=False) \
.agg(total_hh_targeted =('hh_targeted', 'sum'))
camp_hh_targeted_type
| campaign_type | total_hh_targeted | |
|---|---|---|
| 0 | Type A | 3647 |
| 1 | Type B | 2368 |
| 2 | Type C | 574 |
# Coupons redemption by households
camp_hh_redeemed = campaigns[['campaign_id']] \
.drop_duplicates() \
.merge(coupon_redemptions[['campaign_id', 'household_id']].drop_duplicates(), on='campaign_id', how='left') \
.groupby('campaign_id', as_index=False) \
.agg(hh_redeemed=('household_id', 'nunique')) \
.sort_values('campaign_id') \
.merge(campaign_descriptions[['campaign_id', 'campaign_type']], on='campaign_id') \
.assign(campaign_id=lambda x: pd.to_numeric(x['campaign_id'])) \
.loc[:, ['campaign_id', 'campaign_type', 'hh_redeemed']]
camp_hh_redeemed_type = camp_hh_redeemed.groupby('campaign_type', as_index=False) \
.agg(total_hh_redeemed =('hh_redeemed', 'sum'))
camp_hh_redeemed_type
| campaign_type | total_hh_redeemed | |
|---|---|---|
| 0 | Type A | 604 |
| 1 | Type B | 154 |
| 2 | Type C | 34 |
# Compute Campaign performance
camp_perf_df = camp_hh_targeted.merge(camp_hh_redeemed, on='campaign_id', how='inner') \
.sort_values('campaign_id') \
.assign(redemption_rate=lambda x: (x['hh_redeemed']/x['hh_targeted'])*100) \
.rename(columns={'campaign_type_x': 'campaign_type'}) \
.loc[:, ['campaign_id', 'campaign_type', 'hh_targeted', 'hh_redeemed', 'redemption_rate']] \
.sort_values(by='redemption_rate', ascending=False)
camp_perf_df
| campaign_id | campaign_type | hh_targeted | hh_redeemed | redemption_rate | |
|---|---|---|---|---|---|
| 17 | 18 | Type A | 1133 | 214 | 18.887908 |
| 12 | 13 | Type A | 1077 | 196 | 18.198700 |
| 2 | 3 | Type C | 12 | 2 | 16.666667 |
| 7 | 8 | Type A | 1076 | 158 | 14.684015 |
| 14 | 15 | Type C | 17 | 2 | 11.764706 |
| 18 | 19 | Type B | 130 | 15 | 11.538462 |
| 8 | 9 | Type B | 176 | 20 | 11.363636 |
| 25 | 26 | Type B | 118 | 12 | 10.169492 |
| 15 | 16 | Type B | 188 | 19 | 10.106383 |
| 26 | 27 | Type A | 361 | 36 | 9.972299 |
| 16 | 17 | Type B | 202 | 18 | 8.910891 |
| 9 | 10 | Type B | 123 | 10 | 8.130081 |
| 13 | 14 | Type C | 224 | 18 | 8.035714 |
| 0 | 1 | Type B | 13 | 1 | 7.692308 |
| 3 | 4 | Type B | 81 | 6 | 7.407407 |
| 11 | 12 | Type B | 170 | 11 | 6.470588 |
| 24 | 25 | Type B | 17 | 1 | 5.882353 |
| 21 | 22 | Type B | 276 | 15 | 5.434783 |
| 4 | 5 | Type B | 166 | 8 | 4.819277 |
| 19 | 20 | Type C | 244 | 11 | 4.508197 |
| 1 | 2 | Type B | 48 | 2 | 4.166667 |
| 20 | 21 | Type B | 65 | 2 | 3.076923 |
| 10 | 11 | Type B | 214 | 6 | 2.803738 |
| 6 | 7 | Type B | 198 | 5 | 2.525253 |
| 22 | 23 | Type B | 183 | 3 | 1.639344 |
| 5 | 6 | Type C | 65 | 1 | 1.538462 |
| 23 | 24 | Type C | 12 | 0 | 0.000000 |
# Compute max campaign ID
max_camp_df = campaigns[['campaign_id']].astype({'campaign_id': 'int'}).agg({'campaign_id': 'max'}).reset_index()
maximum_campaign_id = max_camp_df[0][0]
maximum_campaign_id
27
# Campaign performance vs campaign type scatter plot
sns.set(rc={'figure.figsize':(11.7,8.27)})
plot= sns.catplot(data=camp_perf_df, x="campaign_id", y="redemption_rate", col = 'campaign_type', hue= 'campaign_type')
plot.set_axis_labels("campaign_id", "redemption_rate")
plot.fig.subplots_adjust(top=0.8)
plot.fig.suptitle("Overall Campaign Performance")
for ax in plot.axes.flat:
labels = ax.get_xticklabels() # get x labels
for i,l in enumerate(labels):
if(i%4 != 0): labels[i] = '' # make every 4th label visible
ax.set_xticklabels(labels) # set new labels
plt.show()
sns.despine(left=True)
<Figure size 842.4x595.44 with 0 Axes>
# Coupon Redemption Of Household – By Income Range
demo_income = pd.merge(demographics, coupon_redemptions, on='household_id', how='left')
demo_income['redemption'] = demo_income['redemption_date'].apply(lambda x: 'Never Used a promotion' if pd.isna(x) else 'Used a Promotion')
demo_income = demo_income[['age', 'income', 'redemption']]
demo_income = demo_income.groupby(['income', 'redemption']).size().reset_index(name='hh_count')
demo_income
| income | redemption | hh_count | |
|---|---|---|---|
| 0 | 100-124K | Never Used a promotion | 20 |
| 1 | 100-124K | Used a Promotion | 46 |
| 2 | 125-149K | Never Used a promotion | 19 |
| 3 | 125-149K | Used a Promotion | 98 |
| 4 | 15-24K | Never Used a promotion | 57 |
| 5 | 15-24K | Used a Promotion | 75 |
| 6 | 150-174K | Never Used a promotion | 13 |
| 7 | 150-174K | Used a Promotion | 115 |
| 8 | 175-199K | Never Used a promotion | 6 |
| 9 | 175-199K | Used a Promotion | 21 |
| 10 | 200-249K | Never Used a promotion | 4 |
| 11 | 200-249K | Used a Promotion | 4 |
| 12 | 25-34K | Never Used a promotion | 50 |
| 13 | 25-34K | Used a Promotion | 129 |
| 14 | 250K+ | Never Used a promotion | 9 |
| 15 | 250K+ | Used a Promotion | 3 |
| 16 | 35-49K | Never Used a promotion | 116 |
| 17 | 35-49K | Used a Promotion | 234 |
| 18 | 50-74K | Never Used a promotion | 105 |
| 19 | 50-74K | Used a Promotion | 593 |
| 20 | 75-99K | Never Used a promotion | 65 |
| 21 | 75-99K | Used a Promotion | 247 |
| 22 | Under 15K | Never Used a promotion | 42 |
| 23 | Under 15K | Used a Promotion | 104 |
# Plot Households Redeeming Promotions by Income Range
demo_income_plot = sns.barplot(x='hh_count', y='income', hue='redemption', data=demo_income, palette=['#E69F00', '#56B4E9'])
demo_income_plot.set(title='Households Redeeming Promotions by Income Range', xlabel='Number of Households', ylabel='Income Range')
demo_income_plot.legend(title='Coupon Redemption?')
demo_income_plot.figure.subplots_adjust(left=0.3)
demo_income_plot.figure.set_size_inches(8, 6)
demo_income_plot.set_axisbelow(True)
# Coupon Redemptions Of Household – By Age
demo_age = pd.merge(demographics, coupon_redemptions, on='household_id', how='left')
demo_age['redemption'] = demo_age['redemption_date'].apply(lambda x: 'Never Used a promotion' if pd.isna(x) else 'Used a Promotion')
demo_age = demo_age[['age', 'income', 'redemption']]
demo_age = demo_age.groupby(['age', 'redemption']).size().reset_index(name='hh_count')
demo_age
| age | redemption | hh_count | |
|---|---|---|---|
| 0 | 19-24 | Never Used a promotion | 37 |
| 1 | 19-24 | Used a Promotion | 40 |
| 2 | 25-34 | Never Used a promotion | 92 |
| 3 | 25-34 | Used a Promotion | 329 |
| 4 | 35-44 | Never Used a promotion | 123 |
| 5 | 35-44 | Used a Promotion | 412 |
| 6 | 45-54 | Never Used a promotion | 170 |
| 7 | 45-54 | Used a Promotion | 664 |
| 8 | 55-64 | Never Used a promotion | 36 |
| 9 | 55-64 | Used a Promotion | 123 |
| 10 | 65+ | Never Used a promotion | 48 |
| 11 | 65+ | Used a Promotion | 101 |
# Plot Households Redeeming Promotions by Age
plt.figure(figsize=(8,6))
sns.barplot(x='hh_count', y='age', hue='redemption', data=demo_age, orient='horizontal', alpha=1)
plt.xlabel('Number of Households')
plt.ylabel('Age Group')
plt.title('Number of Households Redeeming Promotions by Age')
plt.legend(title='Coupon Redemption?')
plt.show()
Summary
Limitations
Future Scope